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O . Abstract 

A restless multi-armed bandit problem that arises in multichannel opportunistic communications is considered, 
where channels are modeled as independent and identical Gilbert-Elliot channels and channel state observations 
£f) , are subject to errors. A simple structure of the myopic policy is established under a certain condition on the false 

alarm probability of the channel state detector. It is shown that the myopic policy has a semi-universal structure that 
reduces channel selection to a simple round-robin procedure and obviates the need to know the underlying Markov 
transition probabilities. The optimality of the myopic policy is proved for the case of two channels and conjectured 
O . for the general case based on numerical examples. 

Index Terms: Myopic policy, opportunistic access, restless multi-armed bandit, cognitive radio. 

On 

I. Introduction 

We consider the following stochastic control problem that arises in multichannel opportunistic commu- 

O ' 

00 ' nications. Assume that there are iV independent and stochastically identical Gilbert-Elliot channels [1]. As 

O ■ 

illustrated in Fig. [TJ the state of a channel — "good" or "bad" — indicates the desirability of accessing this 



^ ! channel and determines the resulting reward. The transitions between these two states follow a discrete- 
c3 ; time Markov chain with transition probabilities \Pij}i,j=o,i- This channel model has been commonly used 
to abstract physical channels with memory (see [2], [3] and references therein). Consider, for example, the 
emerging application of cognitive radios for opportunistic spectrum access where secondary users search 
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in the spectrum for idle channels temporarily unused by primary users 
state represents an idle channel while the bad state an occupied channe 



]. For this application, the good 




Fig. 1. The Gilbert-Elliot channel model. 



In each time slot, a user chooses one of the iV channels to sense and subsequently access if the chosen 
channel is sensed to be in the good state. Sensing is subject to errors: a good channel may be sensed as bad 
and vice versa. Accessing a good channel results in a unit reward, and no access or accessing a bad channel 
leads to zero reward. The design objective is the optimal sensing policy for channel selection in order to 
maximize the expected long-term reward. This problem can be formulated as a partially observable Markov 
decision process (POMDP) for generally correlated channels, or a restless multi-armed bandit process for 
independent channels. 

It has been shown in [5] that obtaining the optimal policy for a general restless multi-armed bandit 
problem is PSPACE-hard. For special classes of restless bandit processes, however, simple structural policies 
may exist that achieve optimality with low complexity. As shown in this paper, for the multichannel 
opportunistic access problem stated above, the myopic policy for this problem has a simple and robust 
structure that reduces channel selection to a simple round-robin procedure when the false alarm probability 
of the channel state detector is below a certain value. This structure reveals that the myopic policy does 
not require the knowledge of the transition probabilities of the Markovian model except the order of pn 
and poi- The myopic policy thus automatically tracks variations in the channel model provided that the 
order of pu and p m remains unchanged. Furthermore, exploiting this simple structure, we prove that the 
myopic policy is optimal for N = 2. Numerical examples^ suggest its optimality for general N. 

This technical note extends our earlier work in [6] that assumes perfect observation of channel states. 
As shown in Sections [XT] and fl*nj communication constraints, namely, synchronization in channel selection 



'When the primary network employs load balancing across channels, the occupancy processes of all channels can be considered stochastically 
identical. 

2 Actions given by the myopic policy and the optimal policy are compared numerically for randomly chosen pn and poi and N = 3, 4, 
and 5. All examples show the equivalence between the myopic policy and the optimal policy. 
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between the transmitter and its receiver, require changes in the problem formulation when observations 
are imperfect, and uncertainties in the state of sensed channels complicate the proofs for the structure and 
optimality of the myopic policy. 

II. Problem Formulation 

A. System Model 

Let S(£) = [Si(t), . . . , Sjv(t)] denote the channel states, where S n (t) E {0 (bad), 1 (good)} is the state 
of channel n in slot t. At the beginning of each slot, the user first decides which of the N channels to 
choose for potential access. Once a channel (say channel n) is chosen, the user detects the channel state, 
which can be considered as a binary hypothesis tesj^|: 

Ho : S n (t) = 1 (good) vs. H x : S n (t) = (bad). 

The performance of channel state detection is characterized by the probability of false alarm e and the 
probability of miss detection 6: 

e — Prjdecide Hi \ Ho is true}, 5 — Prjdecide Ho \ Hi is true}. 

For example, in the application of cognitive radios for opportunistic spectrum access, the user can employ 
an energy detector to detect the presence of primary signals. If the measured energy is above a certain 
threshold, the channel is detected as bad {i.e., busy). Otherwise, the channel is considered idle and suitable 
for transmission. 

The user transmits over the chosen channel if and only if the channel is detected as in the good state. 
Thus, one of the following four possible events can occur in each slot: (i) the chosen channel is good 
and is correctly detected as such, resulting in a successful transmission; (ii) a false alarm occurs, and 
a communication opportunity is missed; (iii) the chosen channel is bad and is correctly detected; the 
transmitter refrains from transmitting; (iv) a miss detection occurs, resulting in a failed transmission. Only 
in the first event, a unit reward is accrued in this slot. The objective is to maximize the average reward 
(throughput) over a horizon of T slots by choosing judiciously a sensing policy that governs channel 

3 We consider here the nontrivial cases with poi and pn in the open interval of (0, 1). When they take the special value of or 1, channel 
state detection can be simplified. Extensions to such special cases are straightforward. 
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selection in each slofl 

Since failed transmissions may occur, acknowledgements are necessary to ensure guaranteed delivery. 
Specifically, when the receiver successfully receives a packet (event (i)), it sends an acknowledgement 
to the transmitter at the end of the slot. Otherwise, the receiver does nothing, i.e., a NAK is defined 
as the absence of an ACK, which occurs when the transmitter did not transmit (events (ii) and (iii)) or 
transmitted over a bad channel (event (iv)). We assume that acknowledgements are received without error 
since acknowledgements are always transmitted over a good/idle channel. 

B. Value Function and Belief Update 

While the full system state S(t) = [Si (£),••• , Sjv(i)] ls not observable, the user can infer the state 
from its decision and observation history. A sufficient statistic for optimal decision making is given by the 
conditional probability that each channel is in state 1 given all past decisions and observations [8]. Referred 
to as the belief vector (or information state), this sufficient statistic is denoted by Q(t) = [u>i(t), ■ ■ ■ , u N (t)], 
where 0Ji(t) is the conditional probability that Sj(t) = 1. In order to ensure that the user and its intended 
receiver tune to the same channel in each slot, channel selections should be based on common observations: 
the acknowledgement K{t) E {0 (NAK), 1 (ACK)} in each slot rather than the detection outcome at the 
transmitter. Given the action a and observation K a {t) — k (k — 0, 1), the belief vector in slot t + 1 can be 
obtained via the Bayes rule. 

Pu, a = i,K a (t) = l 

<*(* + !)= I n ^ t ;:itl m) h a = i,K a (t) = , (1) 
r(ui(t)), a ^ i 

where the operator T(-) is defined as T(x)=xpn + (1 — x)p i. 

A sensing policy n specifies a sequence of functions n = [ni, ir 2 , ■ ■ ■ , kt] where n t maps a belief vector 
fi(£) to a sensing action a(t) E {1, • • • , N} for slot t. We thus arrive at the following stochastic control 

4 Note that often the design should be subject to a constraint on the probability of accessing a bad channel, which may cause interference 
or waste energy. For example, in the application of cognitive radios for opportunistic spectrum access, transmitting over a bad (busy) channel 
leads to a collision with primary users and should be limited below a prescribed level. This constrained stochastic control problem requires 
the joint design of the channel state detector (i.e., how to choose the detection threshold to trade off false alarms with miss detections), the 
access policy that decides the transmission probability based on imperfect detection outcome, and the sensing policy for channel selection. It 
has been shown in [7] under a general correlated channel model that the optimal detector is the Neyman-Pearson detector with the probability 
of miss detection given by the maximum allowable probability of collision, and the optimal access policy is to simply trust the detection 
outcome: transmit if and only if the channel is detected as good. The optimal sensing policy can then be designed using this optimal detector 
and the optimal access policy without the constraint on accessing a bad channel. This is the problem addressed in this paper. 
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7r* = arg max E„ 



problem. 

£)**<n(t))(f)|n(l) , (2) 
,t=i 

where R-„-t(n(t))(t) is the reward obtained when the belief is Q(t) and channel a = 7T|(fl(t)) is selected, 
and is the initial belief vector. This problem falls into the general model of POMDP. It can also be 
considered as a restless multi-armed bandit problem by treating the belief value of each channel as the 
state of each arm of a bandit. 

Let Vt(O) be the value function, which represents the maximum expected remaining reward that can be 
accrued starting from slot t when the current belief vector is fi. We have the following optimality equation. 

Vr(fi) = max^u; a (l — e), 

V t (p) = max {u a (l-e)+u a (l-e)V t+1 (T(n\aA)) + (l-u> a (l-6))V t+1 (T(n\a,0))}, 

a=l,- - ,N 

where T(0|a, i) denotes the updated belief vector for slot t+1 after incorporating action a and observation 
K(t) = i as given in ©. 

In theory, the optimal policy n* can be obtained by solving the above dynamic program. Unfortunately, 
this approach is computationally prohibitive due to the impact of the current action on the future reward 
and the uncountable space of the belief vector il. 

III. Structure and Optimality of Myopic Policy 
A myopic policy ignores the impact of the current action on the future reward, focusing solely on 
maximizing the expected immediate reward E[i2 (i)] = u a (t)(l — e). It is an index policy and is stationary: 
the mapping from belief vectors to actions does not change with time t. The myopic action a(t) in slot t 
under belief state £l(t) is simply given by 

a(t) = arg max (3) 

In general, obtaining the myopic action in each slot requires the recursive update of the belief vector Q(t) 
as given in ([]]), which requires the knowledge of the transition probabilities {p-ij}. As shown in Theorem [H 
for the problem at hand, the myopic policy has a simple structure that does not need the update of the 
belief vector or the knowledge of the transition probabilities. 

The basic element in the structure of the myopic policy is a circular ordering C of the channels. For 
a circular order, the starting point is irrelevant: a circular order C = (ni,n 2 ,-- - ,n N ) is equivalent to 
(rii, n i+ i, ■ • • , n N , m, n 2 , • • • , rij-i) for any 1 < i < N. 
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We now introduce the following notations. For a circular order C, let — C denote its reverse circular 
order, i.e., for C = (ni,n 2 , • • • , tin), we have — C = (njv, njv-i) • • • , n i)- For a channel i, let i£ denote 
the next channel in the circular order C. For example, for C = (1, 2, • - • ,N), we have i£ = i + 1 for 
1 < % < N and N£ = 1. 

We present below the structure of the myopic policy. We assume first that the initial belief value u>i(l) 
of each channel is bounded between p i and pu. In Appendix B, we show that when this condition on the 
initial belief values is violated, the same structure holds for t > 2. The only difference is that special care 
needs to be given to the second slot. This can be seen from the belief update given in ©. Specifically, 
for any initial belief value, the updated belief of each channel (observed or unobserved) in slot t > 2 is 
bounded between p 01 and p u ; a belief value outside the interval of [min{p i,Pn}, niax{p 01 ,p n }] can only 
occur in the first slot as a given initial state, thus referred to as a transient belief state. 

Theorem 1: Structure of Myopic Policy. 
Let 11(1) = ,lo n (1)} denote the initial belief vector. Assume that 0^(1) G [min{p i,Pn}, max{p i, 

for all % — 1, 2, • • • , N. The circular channel order C(l) in slot 1 is determined by a descending order 
of Q(l) (i.e., C(l) = (ni,n 2 , ••• ,n N ) implies that u ni (l) > uJ n2 (l) > ••• > uJ nN (l)). Let a(l) = 
argmax i=1) ... >N The myopic action a(t) in slot t (t > 1) is given as follows. 

. Case 1: p n > p 01 and e < ™^ 

j if^a (4 -i)(t-l) = l ... 

a(t) = { , (4) 

ifif a(t _ 1) (t-l) = 




where C(t) = C(l). 



POOPll 



• Case 2: » n < p 01 and e < 

fu rul P01P10 

v . Ja(t-l) if lf a(t _ 1 )(t-l) = 

a(t) = < , (5) 

[ a(t-l)+ (t) if K a( t-i)(t-l) = 1 

where C(t) = C(l) when t is odd and C(t) = —C(l) when t is even. 

Proof: See Appendix A. ■ 

Theorem \T\ along with Appendix B shows that the basic structure of the myopic policy is a round-robin 
scheme based on a circular ordering of the channels. For p n > p 01 (which corresponds to a positive 
correlation between the channel states in two consecutive slots), the circular order is constant: C(t) = C(l) 
in every slot t, where C(l) is determined by a descending order of the initial belief values. The myopic 
action is to stay in the same channel after an ACK and switch to the next channel in the circular order 
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after a NAK, provided that the false alarm probability e of the channel state detector is below a certain 
value. 

For pn < Poi (which corresponds to a negative correlation between the channel states in two consecutive 
slots), the circular order is reversed in every slot: C(t) = C(l) when t is odd and C(t) = — C(l) when t 
is even, where the initial order C(l) is determined by the initial belief values. The myopic policy stays 
in the same channel after a NAK; otherwise, it switches to the next channel in the current circular order 
C(t), which is either C(l) or — C(l) depending on whether the current time t is odd or everjf). 

This simple structure suggests that the myopic sensing policy is particularly attractive in implementation. 
Besides its simplicity, the myopic policy obviates the need for knowing the channel transition probabilities 
and automatically tracks variations in the channel model. 

We point out that the structure of the myopic sensing policy in the presence of sensing errors is similar 
to that under perfect sensing given in [6]. The proof, however, is more involved since the observations 
here are acknowledgements and the state of the sensed channel cannot be inferred with certainty from a 
NAK. 

Theorem [2] below shows that the myopic sensing policy with such a simple and robust structure is, in 
fact, optimal for N = 2. 

Theorem 2: Optimality of Myopic Policy. 
For N = 2, the myopic policy is optimal when e < for positively correlated channels (pn > p i) 

and e < for negatively correlated channels (p n < p i) when the initial belief values are bounded^ 

between p m and p n . 

Proof: See Appendix B. ■ 
Numerical examples suggest that there exist similar conditions for all N under which the myopic policy 
is optimal. Proving this conjecture turns out to be challenging. A recent work [9] has made progress 
towards proving a corresponding conjecture under the assumption of perfect sensing, by showing that the 
optimality holds for N > 2 under the condition that p n > p m . Furthermore, it is shown in [9] that if the 

5 An alternative way to see the channel switching structure of the myopic policy is through the last visit to each channel (once every channel 
has been visited at least once). Specifically, for pu > poi, when a channel switch is needed, the policy selects the channel visited the longest 
time ago. For pn < poi. when a channel switch is needed, the policy selects, among those channels to which the last visit occurred an even 
number of slots ago, the one most recently visited. If there are no such channels, the user chooses the channel visited the longest time ago. 

6 Recall that a belief value outside the interval of [min{poi,Pn}, max{poi ,Pn}] is transient. For any initial state, the belief values in slots 
t > 2 are bounded between poi and p\\. As a consequence, Theorem 2 shows that when one or more of the initial belief values are transient, 
the myopic policy still provides the optimal actions in all slots except maybe the first slot. 
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myopic policy is optimal under the sum-reward criterion over a finite horizon, it is also optimal for other 
criteria such as discounted and averaged rewards over a finite or infinite horizon. These results may be 
extended to the case with noisy observations, since the optimality proof given in [9] exploits the simple 
structure of the myopic policy, which, as shown here, also holds with noisy observations. 

Both the structure and the optimality of the myopic policy require a certain level of reliability of the 
channel state detector. When this level of reliability is not met, the simple structure of the myopic policy 
may no longer hold, and the myopic actions need to be obtained from © and the recursive belief update in 
(OQ). The optimality of the myopic policy may also be lost in this case. A more complex policy, for example, 
Whittle's index policy [11], may need to be sought after to achieve better performance. This brings out 
an interesting tradeoff between the complexity of the detector at the physical layer and the complexity of 
the sensing strategy at the Medium Access Control (MAC) layer. In particular, the reliability of a detector 
(for example, an energy detector) can always be improved by increasing the sensing time so that a simple 
and optimal policy — the myopic policy — can be employed. The caveat is the reduced transmission time 
for a given slot length. Such a tradeoff can be complex and is beyond the scope of this technical note. 

IV. Conclusion and Discussions 

We have established a simple structure of the myopic policy for channel selection in an iV-channel 
opportunistic communication system under an i.i.d. Gilbert-Elliot channel model. The optimality of this 
simple myopic policy is proved for N = 2 and conjectured for N > 2. This is a non-trivial extension of 
our previous results pertaining to the case of error-free channel state detection [6], as noisy observations 
make it challenging to maintain synchronous channel selection between the transmitter and its receiver. 
This communication constraint adds an interesting twist to the resulting stochastic control problem. 

The optimality of the myopic policy in the context of opportunistic communications may bear significance 
in the general context of restless multi-armed bandit processes. While the classical bandit problems can be 
solved optimally using the Gittins Index [10], restless bandit problems are known to be PSPACE-hard in 
general [5]. Whittle proposed a Gittins-like indexing heuristic for the restless bandit problems [11] which 
is shown to be asymptotically optimal in certain limiting regime [12]. Beyond this asymptotic result, 
relatively little is known about the structure of the optimal policies for a general restless bandit process. 
The optimality of the myopic policy shown in this paper and [6] suggests non- asymptotic conditions under 
which an index policy with a semi-universal structure can actually be optimal for restless bandit processes. 

Approximation algorithms for restless bandit problems have also been explored in the literature. In [13], 
Guha and Munagala have developed a constant-factor (1/68) approximation via LP relaxation for the same 
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class of restless bandit processes as considered in this paper. The difference is that the model in [13] 
allows for non-identical channels but every channel is positively correlated. We point out that negatively 
correlated processes are significantly harder to deal with due to the loss of monotonicity in the belief 
updates (see [6]). In [14], Guha et al. have developed a factor 2 approximation policy for another class of 
restless bandit problems (referred to as monotone bandits) via LP relaxation. Raghunathan et al. [15] have 
also modeled multicast scheduling in broadcast wireless LANs as a restless bandit problem and provided 
a closed-form bound for the performance of Whittle's index policy with respect to the optimal. 

Appendix A: Proof of Theorem [H 

We prove Theorem Q] by showing that the channel a(t) given by (|4|i and (0 is indeed the channel with the largest belief 
value in slot t. Specifically, we prove the following lemma. 

Lemma 1: Let a(t) = i\ be the channel determined by forpn > poi an d by © forpn < poi. Let C(t) — i%, ■ ■ ■ , jjv) 
be the circular order of channels in slot t, where we set the starting point to a(t) = i\. We then have, for any t > 1, 

u il (t)>u i2 (t)>--->u ilt (t), (6) 

i.e., the channel given by and (0 has the largest belief value in every slot t. 

To prove Lemma [T] we note the following properties of the operator T(x) defined in (HJ. 
PI. r(x) is an increasing function for pn > poi and a decreasing function for pn < poi. 

P2. V0 < x < 1, poi < < pn for pn > p 01 and p n < T(x) < p 01 for p n < p 01 . 

P3. For Pll > Pol and e < |12|2L, W e have r( ^ + ^_^ } ) < T(cu') Vp 01 < u,<J < p n ; for p n < p 01 and e < ^£ii, we 

have n ^+Ti-u) ) * r(o/) Vp n < <Poi- 
PI and P2 follow directly from the definition of T(x). To show P3 for pn > poi> it suffices to show e u+(i-u) — P Q1 ^ ue ^° 
the monotonically increasing property of T(x) and the bound on uj' . Noticing that is an increasing function of both 

uj and e, we arrive at P3 by using the upper bounds on oj and e. Similarly, we can show P3 for pn < poi. 

We now prove Lemma Q] by induction. For t = 1, © holds by the definition of C(l). Assume that (0 is true for slot t, 
where C(t) — i%, ■ ■ ■ , zat) and a(t) — i\. We show that it is also true for slot t + 1. 

Consider first pn > poi. We have C(t + 1) = C(t) = (ii, i^, ■ ■ ■ , ij<r). When K lx (t) = 1, we have ait + 1) = a(t) = i\ 
from (|4|i. Since Ui ± (t + 1) = pn achieves the upper bound of the belief values (see P2) and the order of the belief values of 
the unobserved channels remains unchanged due to PI, we arrive at © for t+1. When Ki^t) = 0, we have a(t + 1) = ii 
from ©. We again have © by noticing that u!i 1 (t + 1) = r( £aj ^jjj^pj^ ^ ) is the smallest behef value in slot t + 1 (see 
P3) and C(t + 1) = («2, is, • • • , tNi *i) when the starting point is set to a(t + 1) = i^. 

For pn < poi, C(t + 1) = — C(t) = iff, ijv-i, • • • , «2)- When (t) = 0, we have a(f + 1) = a(t) = i\ from (O. Since 
LOi 1 (t + 1) = r( ) is the largest belief value in slot < + 1 (see P3) and the order of the belief values of the unobserved 

channels is reversed due to PI, we have, from the induction assumption at t, 

Wi x (t+1) >W( w (t+l) + > ■••>w ia (t + l), 
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which agrees with (O for t+1 and C(t+1) = (ii,ijv, ijv-i, ■ • ■ , is). When (t) = 1, we have a(t+l) — ijv from ©. We again 
have (O by noticing that ui^ (t + 1) = pu achieves the lower bound of the belief values and C(t + 1) = (ijy, ijv— i, ' • • 1*2, *i) 
when the starting point is set to a(t + 1) = iy. This concludes the proof of Lemma [1] hence Theorem [1] 

Appendix B: Structure of the Myopic Policy under Transient Initial Belief States 

We now consider when one or more initial belief values are transient, i.e., outside the interval of [min{poi,Pii}, niax{p i,Pii}]- 
Let Q(l) = [wi(l), • • • ,o;jv(1)] denote the initial belief vector. Without loss of generality, assume that oj\(l) > w 2 (l) > • ■ ■ > 
cdjv(l). Thus a(l) = 1. Let r denote the rank of eui{1 ^+l^ l{1)) in { ^^+(1-^(1)) .^(1), ■ ■ • ,^(1)} with r = 1 when 
— 7tt~t7t^ — tttt is the largest and r = N when it is the smallest. When one or more of the initial belief values are transient, 
the myopic action a(t) in slot t (t > 1) is given as follows. 

. Case 1: p n > p 01 and e < 

- If Kmi\(1) = 1, the myopic action a(t) (t > 1) follows the same structure given by (0]) with C(l) = (1,2, • • • , TV). 

- If Ka(\)(l) — 0, the myopic action in slot t — 2 is a(2) = 1 when r = 1 and a(2) = 2 when r > 1. The 
myopic action a(t) for t > 2 follows the same structure given by (0]l with C(l) = (1, 2, • • • , N) when r — 1 and 
C(l) = (2, 3, • • • , r, 1, r + 1, r + 2, • • • , N) when r > 1. 

. Case 2: p n < p 01 and e < 222^ 

- If = 1> tne myopic action a(<) (t > 1) follows the same structure given by © with C(l) = (1,2, • • • , N). 

- If ifa(!)(l) = 0, the myopic action in slot t — 2 is a(2) = 1 when r = N and a(2) = iV when r < N. The 
myopic action a(t) for t > 2 follows the same structure given by <(5j with C(l) = (1, 2, • • • , N) when r — 1 and 
C(l) = (2, 3, • • • , r, 1, r + 1, r + 2, ■ • • , N) when r > 1. 

The above modification can be easily proved based on PI and P2 given in Appendix A. 

Appendix C: Proof of Theorem [2] 

Let Vt(Q) denote the total expected reward obtained under the myopic policy starting from slot t, and Vt(0; a) the total 
expected reward obtained by action a in slot t followed by the myopic policy in future slots. The proof is based on the following 
lemma which applies to a general P0MDR 

Lemma 2: For a T-horizon POMDP, the myopic policy is optimal if for t = 1, • • ■ , T, 

Vt(n)>V t (Q;a), Va.fl. (7) 

Lemma |2] can be proved by reverse induction, where the initial condition of the optimality of the myopic action in that last 
slot T is straightforward. 

We now prove Theorem [2] Considering all channel state realizations in slot t, we have 

Vl(n;o) = (l-e)w a + Pr[S(i) = [si,s 2 ] | n(t)]V t +i(T(Q(t)\a, s a ) | S(t) = [si,s 2 ]), (8) 

»i,*a6{0,l} 

where Vt+i(T(£l(t)\a, s a ) | S(t) = [si,S2]) is the conditional reward obtained starting from slot t+1 given that the system 
state in slot t is [si, s 2 ]. Next, we establish two lemmas regarding the conditional value function of the myopic policy. 
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Lemma 3: Under the conditions of Theorem 1, the expected total remaining reward starting from slot t under the myopic 
policy is determined by the action a(t — 1) and the system state S(t — 1) in slot t — 1, hence independent of the belief vector 
n(t) at the beginning of slot t, i.e., 

V t (T(fl(t-l)\a,s a ) | S(t-l) = [ai,s 2 ])=V t (T(n'(t-l)|a,s ) | S(t - 1) = [si, s 2 ]). 
Adopting the simplified notation of V t (a(t — l)|S(t — 1) = [si, S2]), We further have 

V t (a(t - 1) = l|S(f - 1) = [ Sl , s 2 }) = V t {a{t - 1) = 2|S(t - 1) = [s 2) s x \). (9) 
Proof: Given a(< — 1) and S(t — 1), the myopic actions in slots t to T, governed by the structure given in TheoremQ] are 
fixed for each sample path of system state and observation, independent of fl(t). As a consequence, the total reward obtained 
in slots t to T for each sample path is independent of il(t), so is the expected total reward. (0 follows from the statistically 
identical assumption of channels. ■ 
Lemma 4: Under the conditions of Theorem 1, we have, Vt, a, 

V t (a(t- 1) = a\S(t- 1) = [1,0]) -V t (a(t- 1) = a\S(t - 1) = [0,1]) | < (1 - e). (10) 
Proof: Based on (O, it suffices to consider a(t — 1) = 1. We prove for pu < poi by reverse induction. The proof for 
P11 > P01 is similar. The inequality in ( fTOb holds for t = T since (1 — e) is the maximum expected reward that can be obtained in 
one slot. Assume that the inequality holds for t+1. We show that it holds for t. Consider first Vt(a(t — 1) = l|S(t — 1) = [1,0]). 
With probability 1 — e, the user successfully identifies that channel 1 is in the good state in slot t — 1 and receives an 
acknowledgement at the end of slot 4 — 1. According to the structure of the myopic policy, the user switches channel in slot t, 
i.e., a(t) — 2. The expected immediately reward in slot t is thus poi{l — e) since the state of channel 2 in slot t — 1 is 0. We 
thus arrive at the first term of (fTTT i. where Vt(a(t — 1) = l\S(t — 1) = [1, 0]) is given by the summation of poi(l — e) an d the 
future reward starting from slot t+1 conditioned on all four possible system states in slot t. With probability e, a false alarm 
occurs in slot t — 1, resulting in a NAK. The user thus stays in channel 1 in slot t: a(t) = 1. We thus arrive at the second 
term of (fill) . Similarly, we obtain V t {a(t — 1) = l|S(f - 1) = [0, 1]) as given in (fPH i. which follows from the fact that a NAK 
occurs in slot t — 1 due to the given bad state of the chosen channel 1. 

t4(l|[l,0]) - (1 - e) {p i(l - e) +PioPooVt+i(2|[0 > 0]) +p u p in fl (2|[l l 1]) + PnPooV t +i(2\[l, 0]) + PwPoiV t+1 (2\[Q, 1]) 
+e {p u (l - e) + p lo p O o^+i(l|[0, 0]) + PllPoi V t+1 (l\[l , 1]) +pnPooV r t+i(l|[l, 0]) + P i PoiV r t +i(l|[0, 1])} (1 



U t (l|[0,l]) = poi(l-e)+p ao p 10 V 1 + 1 (l\[0,0])+p 01 p 11 Vt+i^ (1 
Applying (O and the upper bound on e, we have 



V t (l\[0,l])-V t (l\[l,0}) 



< 


(1- 


e)poi - (1 - e)(epn + (1 


< 


2(1 


- e)e(poi -P11) 


< 


2(1 


xPOOPll, s 

e) (P01 P11) 
P01P10 


< 


(1- 





e)p i) + e t4+i(l|[l,0])-y t+1 (l|[0,l] (pioPoi-PuPoo) 
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where the last inequality follows from (pgi ~Pn)^ < \ an d 2sa < 1, ■ 
We now show that (0 in Lemma [2] holds. Consider fi(t) = [u>i(t), u>2(t)] with > u>2(t), i.e., the myopic action in slot 

t is a(t) = 1. Applying (O and Lemma 4 to ©, we have 

V t (Q; a = 1) - K t (n ; a = 2) = («i - wa)(l - e + t4 +1 (l|[l, 0]) - *4+i(l|[0, 1])) > 0. 
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